Guide on optimizing Parquet file reading to reduce memory usage by haianhng31 · Pull Request #594 · G-Research/ParquetSharp

haianhng31 · 2025-11-30T01:22:34Z

No description provided.

adamreeve

Nice start thanks @haianhng31

Can you also please add a link to the new guide as well as the visitor pattern one to the list at the bottom of index.md?

adamreeve · 2025-12-01T00:29:04Z

+
+APIs for reading Parquet files:
+1. **LogicalColumnReader API** - Column-oriented reading with type-safe access
+2. **Arrow API (FileReader)** - Row-oriented reading using Apache Arrow's in-memory format


The Arrow format is still column-oriented

Suggested change

2. **Arrow API (FileReader)** - Row-oriented reading using Apache Arrow's in-memory format

2. **Arrow API (FileReader)** - Reading using Apache Arrow's in-memory format

adamreeve · 2025-12-01T00:33:17Z

+
+Each API offers different memory management options that impact memory usage.
+
+## Memory Configuration Parameters


This should include a section on the buffered stream parameter (ReaderProperties.EnableBufferedStream). Maybe this could be combined with the Buffer Size section as the buffer size is only used when the buffered stream is enabled? It would be helpful to also link to the documentation for the relevant methods for setting each parameter.

adamreeve · 2025-12-01T00:37:12Z

+### 1. Buffer Size
+Controls the size of I/O buffers used when reading from disk or streams.
+
+**Default**: 8 MB (8,388,608 bytes) when using default file reading


I think the default is actually 16384, where did you get 8 MB from?

https://github.com/apache/arrow/blob/82324f0023699d02ecef3e3cf4ab603d473c0ed6/cpp/src/parquet/properties.h#L61

adamreeve · 2025-12-01T00:38:12Z

+**Impact**: Larger buffers reduce I/O operations but increase memory usage. Smaller buffers are more memory-efficient but may decrease throughput.
+
+### 2. Chunked Reading
+Instead of loading entire columns into memory, read data in smaller chunks.


This could do with some clarification. Is this referring to using the LogicalColumnReader API and controlling buffer/chunk sizes yourself?

adamreeve · 2025-12-01T00:40:47Z

+**Impact**: Pre-buffering can significantly increase memory usage as it loads data from future row groups before they're needed. This is the primary cause of memory usage scaling with file size reported in Apache Arrow [issue #46935](https://github.com/apache/arrow/issues/46935).
+
+### 4. Cache (Arrow API Only)
+The Arrow API uses an internal `ReadRangeCache` that stores buffers for column chunks.


I think this can be merged into number 3, as the cache options only apply when using pre-buffering and are used to configure the pre-buffering behaviour.

adamreeve · 2025-12-01T00:42:21Z

+        for (int col = 0; col < metadata.NumColumns; col++)
+        {
+            using var columnReader = rowGroupReader.Column(col);
+            using var logicalReader = columnReader.LogicalReader<float>();


I think it's worth pointing out that ParquetSharp has its own buffering in the LogicalReader API, and this can be configured with the bufferLength parameter of this method.

adamreeve · 2025-12-01T00:45:50Z

+{
+    // Use a buffered stream with custom buffer size (1 MB in this example)
+    using var fileStream = File.OpenRead(filePath);
+    using var bufferedStream = new BufferedStream(fileStream, bufferSize);


By using a buffered stream, I actually meant enabling it in the ReaderProperties with ReaderProperties.EnableBufferedStream.

In previous investigations I've found that this can significantly reduce memory usage.

I don't think using a .NET System.IO.BufferedStream will change memory usage characteristics much.

adamreeve · 2025-12-01T00:49:15Z

+- **Columns**: 10 float columns
+- **Rows**: 100 million (1 million per row group)
+- **Compression**: Snappy
+- **Test System**: MacBook (*Note: real-world performance may vary depending on your Operating System, environment*)


adamreeve · 2026-03-16T21:54:31Z

Closing this as superseded by #611

haianhng31 added 3 commits November 30, 2025 00:12

Add ReadingMemoryOptimization file

b6e066e

Add Arrow reader

4e8eb85

Revise the doc

01da1e6

adamreeve reviewed Dec 1, 2025

View reviewed changes

adamreeve closed this Mar 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guide on optimizing Parquet file reading to reduce memory usage #594

Guide on optimizing Parquet file reading to reduce memory usage #594
haianhng31 wants to merge 3 commits intoG-Research:masterfrom
haianhng31:memoryusage

haianhng31 commented Nov 30, 2025

Uh oh!

adamreeve left a comment

Uh oh!

adamreeve Dec 1, 2025

Uh oh!

adamreeve Dec 1, 2025

Uh oh!

adamreeve Dec 1, 2025

Uh oh!

adamreeve Dec 1, 2025

Uh oh!

adamreeve Dec 1, 2025

Uh oh!

adamreeve Dec 1, 2025

Uh oh!

adamreeve Dec 1, 2025

Uh oh!

adamreeve Dec 1, 2025

Uh oh!

adamreeve commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	2. Arrow API (FileReader) - Row-oriented reading using Apache Arrow's in-memory format
	2. Arrow API (FileReader) - Reading using Apache Arrow's in-memory format


		Each API offers different memory management options that impact memory usage.

		## Memory Configuration Parameters

Conversation

haianhng31 commented Nov 30, 2025

Uh oh!

adamreeve left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamreeve commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants